Data-Centric AI Competition Approach

This is the report is for hackathon Data-Centric AI Competition by Deeplearning.ai and steps taken to achieve accuracy improvement on Roman MNIST data-set by improving the data instead of improving the model.
Bunty Shah
Created on September 5|Last edited on September 6
Comment
﻿
Look here to check on Data-Centric AI hackathon - https://https-deeplearning-ai.github.io/data-centric-comp/?utm_source=linkedin&utm_medium=social&utm_campaign=dc-ai-competition&utm_content=dl-ai﻿
💡
Initial Observation looking at dataInitial observation on the data was as below:
Identify incorrect labeling (eg. an “I” labeled as a “III”)
Remove noisy pictures (some examples hidden below)
Few noisy examples as shown above
3. There are different types of images for single label i.e. i or I 
Initial strategy Consistent labeling - There were few images which we can consider as ambiguous as below
Is this example 2 or 6 ?
2. Delete noisy data: Delete the noisy data as described above
3. Define correct split: Since we have different type of images for same class as below, i have used labeling tool to assign meta-data to this images like Type 1, Type 2, Type 3 and Delete. We can use this meta-data for stratified split of the data between training and validation.
Number 1 can be written in 3 different types, used labeling tool to define those types
4. Log different version of database : I have used W&B to log different versions of database 
5. Error analysis: I have used W&B to log images, ground truth and prediction to identify wrong labels and analyze those labels for further improvement.
First submission:I have done 2 submissions to check if the data splitting the different type of data into training and validation has any effect or not. Below are the results:
Stratified split data has higher score of local as well as LB
Augmentation Idea:I have observed that the image size is way too large and compressing it into 32 * 32 might lead degraded quality of the images. The idea is to crop the image such that it removes additional side space.
I have used below OpenCV code to crop the images with additional white space.
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = 255*(gray < 128).astype(np.uint8) # To invert the text to white
coords = cv2.findNonZero(gray) # Find all non-zero points (text)
x, y, w, h = cv2.boundingRect(coords) # Find minimum spanning bounding box
rect = img[y:y+h, x:x+w] # Crop the image - note we do this on the original image
cv2.imshow("Cropped", rect) # Show it
I merged all the images with existing training and validation set. I found below score after submission.
﻿
Use of Augmentor libraryI have then used augmentor library with augmentation like random rotate,random distortion and random erasing also tried normalizing images and augmentation as well. I received below score:
﻿
I had an idea to invert the images and found this useful explanation: https://stats.stackexchange.com/questions/220164/impact-of-inverting-grayscale-values-on-mnist-dataset﻿
💡
Inverting imagesI have then inverted (Black pixel becomes white and visa versa) all the images and added with the mixed data set.
The last augmentation worked well and achieved a 82% accuracy on test set.
Further experimentsBased on error analysis, i found that score for images with label 2,3,7 and 8 were very low. I augmented data for those class.
I experimented with augmented data (Score with 0.7987) and did an alternate invert of images.
I experimented by adding more data to best score for class 2,3,7 and 8
I experimented with inverting 3 out of 1 image.
I am still awaiting score on above experiments.
﻿
Different experiment tracking in W&B﻿
Run set92
﻿
﻿
﻿
Add a comment